Online Learning with Ensembles 



R. Urbanczik 
Neural Computing Research Group 
Aston University 
Aston Triangle, Birmingham B4 7ET, U.K. 

February 1, 2008 

Abstract 

Supervised online learning with an ensemble of students random- 
ized by the choice of initial conditions is analyzed. For the case of the 
perceptron learning rule, asymptotically the same improvement in the 
generalization error of the ensemble compared to the performance of 
a single student is found as in Gibbs learning. For more optimized 
learning rules, however, using an ensemble yields no improvement. 
This is explained by showing that for any learning rule / a transform 
/ exists, such that a single student using / has the same generalization 
behaviour as an ensemble of /-students. 

Online learning, where each training example is presented just once to 
the student, has proved to be a very successful paradigm in the study of 
neural networks using methods from statistical mechanics ||. On the one 
hand, it makes it possible to rigorously || analyze a wide range of learning 
algorithms. On the other hand, online algorithms can in some cases yield 
a performance which equals that of the Bayes optimal inference procedure, 
e.g. asymptotically, when the probability of the data is a smooth function of 
the parameters of the network J| . 

Some problems, however, do remain. For nonsmooth cases, which arise 
e.g. in classification tasks, the Bayes optimal procedure yields a superior 
generalization performance, even asymptotically, to that of online algorithms 
H FJ . Also even for smooth problems, the online dynamics often has subop- 
timal stationary points arising from symmetries in the network architecture. 



Then the sample size needed to reach the asymptotic regime will scale faster 
than linearly with the number of free parameters if no prior knowledge is 
built into the initial conditions of the dynamics |1[ . 

It thus seems of interest to ask which extensions of the online framework 
make sense. Here we shall consider using an ensemble of students randomized 
by the choice of initial condition and classifying a new input by a majority 
vote. This may be motivated by the fact that in the batch case the Bayes 
optimal inference procedure can be implemented by an ensemble picked from 
the posterior (given the training set) distribution on the set of all students. 
We shall consider an ensemble of K students; at time step \i the i-th student 
is characterized by a weight vector Jf e M . The learning dynamics is based 
on a training set of aN input/output pairs r M ) where £ M G M N . We shall 
consider realizable learning in a perceptron, so = sign(i? T ^ /i ) where B is 
the N- dimensional weight vector defining the teacher and it is convenient to 
assume that \B\ — 1 holds for the Euclidean norm. The dynamics of the i-th 
student then takes the form 

= j? +eN~ l f(»/N,\Jt\,B T e,j? T e) o) 

and the choice of the real valued function / defines the learning rule. Rea- 
sonably / may only depend on the third argument B T ^ via its sign r M , but 
it is not helpful to make this explicit in the notation. Note that all of the 
members of the ensemble learn from the same training examples, and these 
are presented in the same order. 

Assuming that the components of the example inputs are independent 
random variables picked from the normal distribution on M, the state of 
the ensemble can be described by the order parameters Ri(a) = B T Jf N and 
Qij(a) = J^ N Jf N ■ For a reasonable choice of / the order parameters 
will be nonfluctuating for large N and satisfy the differential equations: 

Qij — ( X ifj X jfi fi fj )xi,Xj,y 

ft = f(<*,Qiy,Xi), (2) 

where y and the Xi are zero mean Gaussian random variables with covariances 
(xiy) = Ri and (xiXj) = Qij. We shall only consider the case where the 
initial values Jf are picked independently from the uniform distribution on 
a sphere with radius JP(0). Then for large iV the initial conditions for @ 
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are Ri(0) = Qij(0) = for i ^ j and Qu(0) = -P(O). These conditions 
are invariant under permutations of the site indices i and this also holds for 
the system of differential equations (0). Thus this site symmetry will be 
preserved for all times and we need only consider the three order parameters 
R(a) = Ri(a), P(a) = Qu(a) and Q(a) = Qij(a) for i ^ j. Since the length 
of the students is of little interest, it will often be convenient to consider the 



normalized overlaps r(a) = R(a) / 'J P{a) and q(a) = Q(a)/ P(a). 

A new input £, picked from the same distribution as the training inputs, 
will be classified by the ensemble using a majority vote, that is by: 

^(0 = sign(f:sign(jr T 0). (3) 

i=i 

As an alternative to using a majority vote, one might consider constructing a 
new classifier by averaging the weight vectors of the students, setting J aN = 
K~ x J2i J? N ■ As in Gibbs theory || a simple application of the law of large 
numbers yields that the two classifiers are equivalent in the large K limit 
if q(a) = 0(1), that is cr(£) = sign(J QiV £) for almost all inputs. In the 
sequel we shall only consider the large K limit, assuming that K -C iV so 
that the fluctuations in the site symmetry of the initial conditions can be 
ignored. The generalization error e e of the ensemble, that is the probability 



of misclassifying £, is then given by e e = e(r(a)/ \Jq(a)) where 

e{x) = — arccosx . (4) 

7T 

Similarly, the generalization error of a single student in the ensemble is e s = 
e(r(a)). 

We shall first consider a soft version of the perceptron learning rule: 

where H(x) = |erfc(x/v / 2) and 77 is a time dependent learning rate. For 
k = this reduces to Hebbian learning whereas k = 1 yields the perceptron 
learning rule. Note, however, that the |Jf | prefactor makes the dynamics 
invariant with respect to the scaling of the student weight vectors. From (||) 
one obtains for the order parameters: 

f = -^=(1 -r 2 ) - (e{kr) - -e(k 2 ] 
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(1 - q) + rf ((1 - g)e(Ar) - \e{k 2 q) + |e(fc 2 )) . (6) 



Q = 



V2^ 



r 



We first consider the perceptron learning rule i.e. k = 1. In the limit 



r, g — > 1 one finds r ~ r/y 2/ir(l — r) — rj 2 e(r)/2 and g ~ T]J2/tt(1 — q) — 

r] 2 e(q)/2, that is r and g satisfy the same differential equation. If the learning 
rate schedule is such that this limit is reached, this means that that (1 — 



r)/(l — q) will approach 1 for large a. Hence asymptotically e e ~ e(yr(a)), 

and the same improvement by a factor of 1/ \/2 in the generalization error of 
the ensemble compared to single student performance is found as in Gibbs 
learning. (Interestingly the same asymptotic relationship between e e and e s 
also holds for the Adatron learning rule / = —Q{—t^J^ T ^)J^ T ^). The 
optimal asymptotics of the learning rate schedule is r] ~ 2\f2~Txja and this 
yields an e e ~ ~ 0.90/a decay of the ensemble generalization error. 
This is very close to the 0.88/a decay found for the optimal single student 
algorithm ||. 

We next consider improving the performance by tuning k. From (||) 
one easily sees that single student performance is optimized when k = r. 
Asymptotically this may be achieved by setting k ~ 1 — A/ a 2 and choosing 
the optimal learning schedule which is asymptotically the same as for the 
standard perceptron learning rule. Then already a single student achieves 
e r-u 2^/1 th&t is the same large a behaviour as the ensemble in the k = 1 
case. Unfortunately r and q now have a different asymptotics and one finds 
1 — q <^ 1 — r. So for all practical purposes the ensemble collapses to a single 
point and for large a to leading order e e ~ e s . 

It is of course not clear that optimizing single student performance is a 
good idea, and we thus analyze more generic schedules, setting k ~ 1 — (A/a) 2 . 
Figure 1 then, however, shows that the two case considered above are optimal 
for ensemble and respectively single student performance. 

The above analysis of the soft perceptron rule suggests that while for 
some rules using an ensemble does significantly improve on single student 
performance, for more optimized rules this may no longer be the case. We 
shall now prove that the generalization error of the optimal single student 
learning rule is also a lower bound of the ensemble performance for any 
learning rule /. To achieve this, a learning rule / will be given which for 
each pattern yields the ensemble average of /. Then a single student J M 
using / will have generalization behaviour equal to that of a large ensemble 
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of students using /. The dynamics for J M may be written as 

= > + ^N^fifi/N, B T ^, >V) (7) 
where / is the following integral transform of /: 

f(a, y, x) = (/(a, P(a)?, y, x + (P(a) - Q(a))M) z . (8) 
Here the distribution of z is normal. The entire procedure is quite intuitive: 

~ T i 

jn represents the center of mass of the ensemble and ^ + {P{a)—Q{a)) 1 2 z 
is a guess for the value of the hidden field Jf of one of the ensemble mem- 
bers. For large K the distribution of the last two quantities will be the same, 
and the ensemble average of / can be reliably predicted. Further, note that 
the class of soft perceptron rules (|5|) is invariant under the integral transform 
dD since {H(a + bz)) z = H(a/y/l + b 2 ). This explains why optimizing single 
student and optimizing ensemble performance within this class yields the 
same generalization behaviour. 

To demonstrate that J M does indeed emulate the large ensemble consider 
the order parameters R(a) = B T J aN and Q(a) = \J aN \ 2 . We shall start 
with J° = , thus P(0) = P(0) = Q(0) = Q(0) = 0, and it will suffice to 
show that the pair R, Q satisfies an identical differential equation as the pair 
R, Q. From fl2|) we obtain for Q: 

Q = (^if(a,P(a)^,y,Xj) + f(a,P{a)^,y,x i )f(a,P{a)',y,x j )) (9) 

where i and j are any two different indices. The Gaussians x% and Xj may 
be rewritten in terms of normal random variables Zi, Zj and z, independent 
of each other and of y, as 

Xi = ^P - Qzi + y/Q - R 2 z + Ry and xj = \J P - Qzj + \JQ - R 2 z + Ry . 

(10) 

Carrying out the integrations over z^ and Zj in flj^) yields 

Q = (2xf(<x,y,x) + f(a,y } x) 2 ) , (11) 

\ / y,z 

where x = \JQ — R?z + Ry. The variance of x is Q and its covariance with 
y is R. Applying (0) to yields Q = {2xf(a,y,x) + f(a,y,x) 2 ) , where 

\ ' y,x 

the variance of x is Q and its covariance with y is R. Thus Q and Q satisfy 
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the same differential equation and an analogous argument shows that the 
same holds for R and R. 

It is interesting to ask whether the above equivalence between ensemble 
and single student behaviour carries over to more general situations. Let 
us first consider allowing interactions between the ensemble members. In 
this case much more complicated scenarios can arise. However, if one only 
considers global and symmetric interactions between the ensemble members, 
an equivalent single student rule will often exist. To be specific assume that 
/ may in addition depend on the output of the entire ensemble @. This just 
amounts to allowing / to depend on the random variable x, and with only 
minor modifications the above construction will again yield an equivalent 
single student rule. 

Next consider more general architectures than the simple perceptron. It is 
straightforward to generalize the construction to the case of a tree committee 
machine: one just has to carry out an integration analogous to (§) per branch 
of the tree. The case of the tree parity machine, however, is more involved 
since due to a gauge symmetry, students with differing weight vectors can 
implement the same function. Thus averaging the output of the ensemble 
members ([3]) may no longer be equivalent to averaging the weight vectors. 
But it is straightforward to break the symmetry in a formal way by adding a 
small deterministic drift term of the form B5N~ l to the update equations ([I]) 
of each branch. Then for 5 > the same procedure as for the tree committee 
will yield an equivalent single student rule. In the end, one will of course 
want to take the limit 5^0. In this limit, however, for a training set size 
which is on the order of the number of free parameters in a single student, 
only a trivial generalization behaviour will result ]/]]. So this procedure does 
not allow us to make any statement about the equivalence between ensemble 
and single student performance for the large training sets needed to achieve a 
nontrivial behaviour. It does, however, show that the pathological divergence 
of the training times which results from the symmetry, cannot be overcome 
by the use of an ensemble. Similar remarks as for the tree parity machine 
apply to fully connected architectures. 

So in sharp contrast to batch learning where ensemble performance is 
often superior to single student performance, in online learning one cannot 
improve on optimal single student performance through an ensemble. But 
obviously if the state space of the learning system where large enough to 
store the entire training set, online learning would reduce to the batch case. 
So an ensemble may simply not be an effective way making use of a state 



6 



space which is larger than in the case of a single student, and future research 
should investigate more efficient strategies of utilizing a large state space. 

It is a pleasure to acknowledge helpful discussions with Manfred Opper 
and David Saad. 
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Figure 1: Asymptotics of the soft perceptron learning rule. The general- 
ization error of the ensemble decays as e e ~ 7 e /o;, and for a single student 
e s ~ Is/oi. The dependence of 7 e and 7 S on the parameter A which controls 
the softness of the learning rule via k ~ 1 — (A/a) 2 , is shown in the plot. The 
learning rate schedule is i] ~ 2y / 27r/a. For all values of A, this schedule opti- 
mizes both single student and ensemble performance. For A > 1 the students 
in the ensemble correlate quickly with increasing a, and using an ensemble 
asymptotically yields no improvement over single student performance. 



8 



